In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.
In this exercise, you will:
More instructions provided below. Include your work in this notebook and submit to your Github account.
In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import bokeh.plotting as bkp
from mpl_toolkits.axes_grid1 import make_axes_locatable
%matplotlib inline
import statsmodels.api as sm
#import statsmodels.formula.api as smf
#import statsmodels.graphics as sg
import math
from scipy import stats
from __future__ import division
In [2]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')
In [3]:
# deal with missing and inconvenient portions of data
clean_hospital_read_df = hospital_read_df[(hospital_read_df['Number of Discharges'] != 'Not Available')]
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort('Number of Discharges')
In [5]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)
ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)
ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)
ax.grid(True)
fig.tight_layout()
A. Initial observations based on the plot above
B. Statistics
C. Conclusions
D. Regulatory policy recommendations
Include your work on the following in this notebook and submit to your Github account.
A. Do you agree with the above analysis and recommendations? Why or why not?
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:
You can compose in notebook cells using Markdown:
A. A brief critique of the analysis
B. Statistical analysis of the preliminary report
C. Regression analysis
D. Conclusions
While the correlation is highly statistically significant, the number of discharges doesn't account for that much of the variation in readmission rates. An increase of 1000 discharges only decreases the readmission rate by about 0.03 on average, which is only about 1/3 of a standard deviation of our data readmission rate data. There are clearly many more factors involved with variations in readmission rates, so any policy recommendations that we make should not be given the highest priority until we better understand the causes behind readmission rates and can see if there may be more effective interventions.
We also should do more analysis before jumping to conclusions about policy recommendations. Correlation does not imply causation, and we should try to control for things that are (largely) out of a hospital's control, such as the underlying health of the population. However, these data do not include measures of the underlying health of the population, so we will have to look to other data sources in order to make this analysis.
In [83]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = cleaner_hospital['Number of Discharges'].as_matrix()
y = cleaner_hospital['Excess Readmission Ratio'].as_matrix()
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2,label='source data')
ax.fill_between([0,100], lowdischarge['Excess Readmission Ratio'].min(), lowdischarge['Excess Readmission Ratio'].max(), facecolor='yellow', alpha = .3, interpolate=True, label='low discharge')
ax.fill_between([1000,highdischarge['Number of Discharges'].max()], highdischarge['Excess Readmission Ratio'].min(), highdischarge['Excess Readmission Ratio'].max(), facecolor='green', alpha = .3, interpolate=True,label='high discharge')
linx=np.linspace(0,3000,num=5)
ax.plot(linx, disres.params[0]*linx+disres.params[1],alpha=0.8, color='red',lw=2,label='regression line')
ax.set_xlim([0, 3000]) #There are some big outliers
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)
handles, labels = ax.get_legend_handles_labels()
p1 = plt.Rectangle((0, 0), 1, 1, fc="yellow",alpha=.3)
p2 = plt.Rectangle((0, 0), 1, 1, fc="green",alpha=.3)
handles=handles+[p1,p2]
labels=labels+['low discharge group','high discharge group']
ax.legend(handles,labels)
ax.grid(True)
fig.tight_layout()
In [11]:
cleaner_hospital=clean_hospital_read_df.dropna(subset=['Number of Discharges','Excess Readmission Ratio'])
In [13]:
dismod=sm.OLS(cleaner_hospital['Excess Readmission Ratio'],sm.add_constant(cleaner_hospital['Number of Discharges'],prepend=False))
In [15]:
disres=dismod.fit()
disres.summary()
Out[15]:
In [16]:
disres.pvalues
Out[16]:
In [25]:
disres.params[0]*2000
Out[25]:
In [12]:
lowdischarge=cleaner_hospital[cleaner_hospital['Number of Discharges']<100]
highdischarge=cleaner_hospital[cleaner_hospital['Number of Discharges']>1000]
In [41]:
SE=math.sqrt(lowdischarge['Excess Readmission Ratio'].std()/lowdischarge['Excess Readmission Ratio'].count()+highdischarge['Excess Readmission Ratio'].std()/highdischarge['Excess Readmission Ratio'].count())
mdiff=highdischarge['Excess Readmission Ratio'].mean()-lowdischarge['Excess Readmission Ratio'].mean()
stats.t.cdf(mdiff/SE,min(lowdischarge['Excess Readmission Ratio'].count(),highdischarge['Excess Readmission Ratio'].count())-1)
Out[41]:
In [38]:
lowprop=(lowdischarge['Excess Readmission Ratio'].map(math.ceil)-1).mean()
highprop=(highdischarge['Excess Readmission Ratio'].map(math.ceil)-1).mean()
print(lowprop)
print(highprop)
In [57]:
groupeddata=lowdischarge.append(highdischarge)
groupprop=(groupeddata['Excess Readmission Ratio'].map(math.ceil)-1).mean()
SEprop=math.sqrt((1-groupprop)*groupprop*(1/lowdischarge.shape[0]+1/highdischarge.shape[0]))
In [58]:
tscore=(lowprop-highprop)/SEprop
tscore
Out[58]:
In [59]:
degfree=min(lowdischarge['Excess Readmission Ratio'].count(),highdischarge['Excess Readmission Ratio'].count())-1
degfree
Out[59]:
In [60]:
(1-stats.t.cdf(tscore,degfree))
Out[60]:
In [32]:
print groupeddata.shape
print lowdischarge.shape
print highdischarge.shape
In [76]:
cleaner_hospital['Excess Readmission Ratio'].hist(bins=30,alpha=0.5,color='purple')
Out[76]:
In [62]:
disres.params[0]*1000/cleaner_hospital['Excess Readmission Ratio'].std()
Out[62]: